<!--begin.rcode results='hide', echo=FALSE, message=FALSE

library(caret)
data(BloodBrain)
theme1 <- caretTheme()
theme1$superpose.symbol$col = c(rgb(1, 0, 0, .4), rgb(0, 0, 1, .4), 
  rgb(0.3984375, 0.7578125,0.6445312, .6))
theme1$superpose.symbol$pch = c(15, 16, 17)
theme1$superpose.cex = .8
theme1$superpose.line$col = c(rgb(1, 0, 0, .9), rgb(0, 0, 1, .9), rgb(0.3984375, 0.7578125,0.6445312, .6))
theme1$superpose.line$lwd <- 2
theme1$superpose.line$lty = 1:3
theme1$plot.symbol$col = c(rgb(.2, .2, .2, .4))
theme1$plot.symbol$pch = 16
theme1$plot.cex = .8
theme1$plot.line$col = c(rgb(1, 0, 0, .7))
theme1$plot.line$lwd <- 2
theme1$plot.line$lty = 1

    hook_inline = knit_hooks$get('inline')
    knit_hooks$set(inline = function(x) {
      if (is.character(x)) highr::hi_html(x) else hook_inline(x)
      })
    opts_chunk$set(comment=NA, digits = 3)

session <- paste(format(Sys.time(), "%a %b %d %Y"),
                 "using caret version",
                 packageDescription("caret")$Version,
                 "and",
                 R.Version()$version.string)
    end.rcode--> 

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
  <!--
  Design by Free CSS Templates
http://www.freecsstemplates.org
Released for free under a Creative Commons Attribution 2.5 License

Name       : Emerald 
Description: A two-column, fixed-width design with dark color scheme.
Version    : 1.0
Released   : 20120902

-->
  <html xmlns="http://www.w3.org/1999/xhtml">
  <head>
  <meta name="keywords" content="" />
  <meta name="description" content="" />
  <meta http-equiv="content-type" content="text/html; charset=utf-8" />
  <title>Univariate Filters</title>
  <link href='http://fonts.googleapis.com/css?family=Abel' rel='stylesheet' type='text/css'>
  <link href="style.css" rel="stylesheet" type="text/css" media="screen" />
  </head>
  <body>
  <div id="wrapper">
  <div id="header-wrapper" class="container">
  <div id="header" class="container">
  <div id="logo">
  <h1><a href="#">Feature Selection using Univariate Filters</a></h1>
</div>
  <!--
  <div id="menu">
  <ul>
  <li class="current_page_item"><a href="#">Homepage</a></li>
<li><a href="#">Blog</a></li>
<li><a href="#">Photos</a></li>
<li><a href="#">About</a></li>
<li><a href="#">Contact</a></li>
</ul>
  </div>
  -->
  </div>
  <div><img src="images/img03.png" width="1000" height="40" alt="" /></div>
  </div>
  <!-- end #header -->
<div id="page">
  <div id="content">

<h1>Contents</h1>  
<ul>
  <li><a href="#filter">Univariate Filters</a></li> 
  <li><a href="#syntax">Basic Syntax</a></li> 
  <li><a href="#example">The Example</a></li>   
 </ul>   
 
<div id="filter"></div>
<h1>Univariate Filters</h1>

<p>
Another approach to feature selection is to pre-screen the predictors using simple univariate statistical methods then only use those that pass some criterion in the subsequent model steps. Similar to recursive selection, cross-validation of the subsequent models will be biased as the remaining predictors have already been evaluate on the data set. Proper performance estimates via resampling should include the feature selection step.
</p>

<p>
As an example, it has been suggested for classification models, that predictors can be filtered by conducting some sort of <i>k</i>-sample test (where <i>k</i> is the number of classes) to see if the mean of the predictor is different between the classes. Wilcoxon tests, <i>t</i>-tests and ANOVA models are sometimes used. Predictors that have statistically significant differences between the classes are then used for modeling.
</p>

<p>
The caret function <span class="mx funCall">sbf</span> (for selection by filter) can be used
to cross-validate such feature selection schemes. Similar to
<span class="mx funCall">rfe</span>, functions can be passed into <span class="mx funCall">sbf</span> for the
computational components: univariate filtering, model fitting, prediction and performance summaries (details are given below).
</p>

<p>
The function is applied to the entire training set and also to different resampled versions of the data set. From this, generalizable estimates of performance can be computed that properly take into account the feature selection step. Also, the results of the predictor filters can be tracked over resamples to understand the uncertainty in the filtering.
</p>

<div id="syntax"></div>
<h1>Basic Syntax</h1>

<p>
Similar to the <span class="mx funCall">rfe</span> function, the syntax for <span class="mx funCall">sbf</span> is:
</p>

<!--begin.rcode,eval=FALSE 
sbf(predictors, outcome, sbfControl = sbfControl(), ...)
## or
sbf(formula, data, sbfControl = sbfControl(), ...)
    end.rcode-->

<p>
In this case, the details are specificed using the <span class="mx funCall">sbfControl</span>
function. Here, the argument <span class="mx arg">functions</span> dictates what the
different components should do. This argument should have elements
called <code>filter</code>, <code>fit</code>, <code>pred</code> and <code>summary</code>.
</p>

<b><p><i>The <span class="mx funCall">score</span> Function</p></i></b>

<p>
This function takes as inputs the predictors and the outcome in
objects called <span class="mx arg">x</span> and <span class="mx arg">y</span>, respectively. By default, each predictor in <span class="mx arg">x</span> is passed to the <span class="mx funCall">score</span> function individually. In this case, the function should return a single score. Alternatively, all the predictors can be exposed to the function using the <span class="mx arg">multivariate</span> argument to <span class="mx funCall">sbfControl</span>. In this case, the output should be a named vector of scores where the names correspond to the
column names of <span class="mx arg">x</span>.  
</p>

<p>
There are two built-in functions called <span class="mx funCall">anovaScores</span> and
<span class="mx funCall">gamScores</span>. <span class="mx funCall">anovaScores</span> treats the outcome as the
independent variable and the predictor as the outcome. In this way,
the null hypothesis is that the mean predictor values are equal across
the different classes. For regression, <span class="mx funCall">gamScores</span> fits a smoothing spline in
the predictor to the outcome  using a generalized additive model and tests to see if there is any
functional relationship between the two. In each function the
p-value is used as the score.
</p>

<b><p><i>The <span class="mx funCall">filter</span> Function</p></i></b>

<p>
This function takes as inputs the scores coming out of the
<span class="mx funCall">score</span> function (in an argument called <span class="mx arg">score</span>). The
function also has the training set data as inputs (arguments are called <span class="mx arg">x</span> and <span class="mx arg">y</span>). 
The output
should be a named logical vector where the names correspond to the
column names of <code>x</code>. Columns with values of <tt><!--rinline 'TRUE' --></tt> will
be used in the subsequent model.
</p>


<b><p><i>The <span class="mx funCall">fit</span> Function</p></i></b>

<p>
The component is very similar to the <span class="mx funCall">rfe</span>-specific function described above. For <span class="mx funCall">sbf</span>, there are no <span class="mx arg">first</span> or
<span class="mx arg">last</span> arguments. The function should have arguments
<span class="mx arg">x</span>, <span class="mx arg">y</span> and <span class="mx arg">...</span>. The data within <code>x</code>
have been filtered using the <span class="mx funCall">filter</span> function described
above. The output of the <span class="mx funCall">fit</span> function should be a fitted model.
</p>


<p>
With some data sets, no predictors will survive the filter. In these cases, a model with predictors cannot be computed, but the lack of viable predictors should not be ignored in the final results. To account for this issue, <a href="http://cran.r-project.org/web/packages/caret/index.html"><strong>caret</strong></a> contains a model function called <span class="mx funCall">nullModel</span> that fits a simple model that is independent of any of the predictors. For problems where the outcome is numeric, the function predicts every sample using the simple mean of the training set outcomes. For classification, the model predicts all samples using the most prevalent class in the training data.
</p>

<p>
This function can be used in the <code>fit</code> component function to
``error-trap" cases where no predictors are selected. For example,
there are several built-in functions for some models. The object
<code>rfSBF</code> is a set of functions that may be useful for fitting
random forest models with filtering. The <span class="mx funCall">fit</span> function here
uses <span class="mx funCall">nullModel</span> to check for cases with no predictors:
</p>

<!--begin.rcode select_sbf_fit
rfSBF$fit
    end.rcode--> 

<b><p><i>The <span class="mx funCall">summary</span> and <span class="mx funCall">pred</span> Functions</p></i></b>

<p>
The <span class="mx funCall">summary</span> function is used to calculate model performance on held-out samples. The <span class="mx funCall">pred</span> function is used to predict new samples using the current
predictor set. The arguments and outputs for these two functions are identical to the previously discussed <span class="mx funCall">summary</span> and <span class="mx funCall">pred</span> functions in previously described sections.
</p>

<div id="example"></div>
<h1>The Example</h1>

<p>
Returning to the example from (Friedman, 1991), we can fit another
random forest model with the predictors pre-filtered using the
generalized additive model approach described previously.
</p>


<!--begin.rcode select_load_sim,echo=FALSE,message=FALSE
library(mlbench)
n <- 100
p <- 40
sigma <- 1
set.seed(1)
sim <- mlbench.friedman1(n, sd = sigma)
colnames(sim$x) <- c(paste("real", 1:5, sep = ""),
                     paste("bogus", 1:5, sep = ""))
bogus <- matrix(rnorm(n * p), nrow = n)
colnames(bogus) <- paste("bogus", 5+(1:ncol(bogus)), sep = "")
x <- cbind(sim$x, bogus)
y <- sim$y
normalization <- preProcess(x)
x <- predict(normalization, x)
x <- as.data.frame(x)
subsets <- c(1:5, 10, 15, 20, 25)
    end.rcode--> 

<!--begin.rcode select_sbf_rf,cache=TRUE
filterCtrl <- sbfControl(functions = rfSBF, 
                         method = "repeatedcv", repeats = 5)
set.seed(10)
rfWithFilter <- sbf(x, y, sbfControl = filterCtrl)
rfWithFilter
    end.rcode--> 

<p>
In this case, the training set indicated that
<!--rinline I(length(predictors(rfWithFilter))) --> should be used in the random
forest model, but the resampling results indicate that there is some
variation in this number. Some of the informative predictors are
used, but a few others are erroneous retained.
</p>

<p>
Similar to <span class="mx funCall">rfe</span>, there are methods for <span class="mx funCall">predictors</span>,
<span class="mx funCall">densityplot</span>, <span class="mx funCall">histogram</span> and <span class="mx funCall">varImp</span>.
</p>

<!-- ------------------------ end #content------------------------ -->

<div style="clear: both;">&nbsp;</div>
  </div>
  <!-- end #content -->
<div id="sidebar">
<ul>
  <li>
    <h2>General Topics</h2>
    <ul>
      <li><a href="index.html">Front Page</a></li>
      <li><a href="visualizations.html">Visualizations</a></li>
      <li><a href="preprocess.html">Pre-Processing</a><li>
      <li><a href="splitting.html">Data Splitting</a></li>
      <li><a href="varimp.html">Variable Importance</a></li>
      <li><a href="other.html">Model Performance</a></li>
      <li><a href="parallel.html">Parallel Processing</a></li>
    </ul>
    <h2>Model Training and Tuning</h2>
    <ul>
      <li><a href="training.html">Basic Syntax</a></li>
      <li><a href="modelList.html">Sortable Model List</a></li>
      <li><a href="bytag.html">Models By Tag</a></li>
      <li><a href="similarity.html">Models By Similarity</a></li>
      <li><a href="custom_models.html">Using Custom Models</a></li>
      <li><a href="sampling.html">Sampling for Class Imbalances</a></li> 
      <li><a href="random.html">Random Search</a></li> 
      <li><a href="adaptive.html">Adaptive Resampling</a></li> 
    </ul>
    <h2>Feature Selection</h2>
    <ul>
      <li><a href="featureselection.html">Overview</a>
      <li><a href="rfe.html">RFE</a></li>
      <li><a href="filters.html">Filters</a></li>
      <li><a href="GA.html">GA</a></li>
      <li><a href="SA.html">SA</a></li>
    </ul>  
  </li>
</ul>
</div>
<!-- end #sidebar -->
<div style="clear: both;">&nbsp;</div>
  </div>
  <div class="container"><img src="images/img03.png" width="1000" height="40" alt="" /></div>
  <!-- end #page -->
</div>
  <div id="footer-content"></div>
  <div id="footer">
<!--begin.rcode echo = FALSE
knit_hooks$set(inline = hook_inline)    
    end.rcode-->   
  <p>Created on <!--rinline I(session) -->.</p>
  </div>
  <!-- end #footer -->
</body>
  </html>
